Table of contents:

  • Part 02 - Data Analysis & Visualization
  • Part 03 - Machine Learning
  • Part 04 - Discussion & Contribution

Final Project | Explainer Notebook.

**02806 Social data analysis and visualization**

**May 2021**

**Data-sets Reference: Motor-Vihecle-Collisions[link](https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95), Weather-Data[link](https://www.ncdc.noaa.gov/cdo-web/search), Speed-Limit-Data[link](https://data.cityofnewyork.us/Transportation/VZV_Speed-Limits/7n5j-865y)**


**Please note!** If you are using Jupyter to display this ".ipynb" file You might need to make it Trusted in order to let Jupyter render the plots.

Part 01 - Data Preprocessing


Import needed libraries:



Load data:



Getting to know the Dataset:


Let's start in getting to know the dataset. In this section we introduce a function to track the reduction in data when doing the preparation and cleaning.

The reader gets familiar with the whole process starting from vewing the data, the column types and the different ways of tracking the number of missing values.

Below we see the NaN values per column. The NYC collision dataset that we use, log the 3rd, 4rd and 5th factors in a collision by loging the vehicle types and the accordingly contributing factors but very reraly the is more than a second contributing factor.

We realise that there are many missing values there because the collisions normally is limited between 2 factors, the cars in the car crash.

These values will be handled later.

Below are the attributes with data filled. CRASH DATE and CRASH TIME are the the first thing police officers write, as well as the number of people killed or injured. We see some missing values on generaly some STREET NAMEs, BOROUGHs, as well as the 3rd, 4th and 5th factors are rarely added for the aforementioned factors.

Below, we spot some 0's in the LATITUDE and LONGITUDE that doesn't make sense. These values will be handled later on.

Some times NaN values can come in form of empty strings. We check this posibility in case we spot some extra missing values. There are not any.


Data Cleaning:


After this initial check on the data, we are proceeding into the Data Cleaning.

In this section we finalize the New York City Collisions dataset from 2013-2020 in order to merge (in the next section) the weather and speed limit features.

The steps in doing the clean consider dropping unneeded features with the accordingly justification.

Drop unneeded features:

In this section we drop the COLLISION_ID since it's not informative. We drop the LOCATION since we have LATITUDE and LONGITUTE. We drop the PEDESTRIANS, CYCLISTS, MOTORIST features sience we have the NUMBER OF PERSONS INJURED and NUMBER OF PERSONS KILLED features and finally we drop the 3rd, 4th and 4th collision factors (as discussed above).

Finally, we check for the ammount of data dropped overall. We see that we drop only the 0,3% of the total data. That is a very satisfying overall!


Missing Data:


In this section we drop all the NaN values on the attributes ON STREET NAME, LATITUDE, LONGITUDE, NUMBER OF PERSONS KILLED and NUMBER OF PERSONS INJURED. We also drop the accidents that have LATITUDE and LONGITUDE == 0 because the location is not recorded.

Finally, we see that we loose 29% of the data but still the number of rows is more than sufiecient to have a represented dataset.


Features Prepration:


In this section a cleaning of the categorical values, those are taking place on important factors like the VEHICLE TYPES and CONTRIBUTING FACTORS attributes. We also clean the ZIP column. The justification is described below.

After that, we create new features. Most importantly the Response attribute that indicates if victims of the collision got injured or died (in the specific car accident) and indicating time reference features like the Year of the accident as well as the Month, Day, Hour, Day of week and Minute.

Prepare Vehicle types:

Our goal in the preperation of the vehicle types without introducing any bias to the data. That means the scope of this preparation doesn't involve handwritten merging subcategories other than fixing typos and then select the types that matter for the 95% Frequency of the MCV Vehicle types.

The methodology is as following;

  1. Select the Vehicle Types that has more than 50 motor vehicle collision (MVC) occurrences
  2. Replace typos (with no subcategory merge in order to avoid bias)
  3. Consider only the vehicle types that occur for the 95% of vehicle collisions (MVC)

We see that the final vehicles type 1 involve the following (11) vehicles: 'sport utility vehicle', 'sedan', 'passenger vehicle', 'taxi', 'pick-up truck', 'van', 'bus', 'unknown', 'other', 'box truck', 'small com veh(4 tires)'.

We do the same for the vehicle type 2 and we end up with the following vehicles: 'sport utility vehicle', 'unknown', 'sedan', 'passenger vehicle', 'taxi', 'bike', 'pick-up truck', 'van', 'bus', 'other', 'box truck'.



At the end, we combine the two vehicle types categorical values and we end up with only loosing 5% of the data. This is very welcoming.

Prepare Vehicle type 1:

Prepare Vehicle type 2:

Slice Focus Vehicle Types (covers more than 95 % of MVC occurrences)

Prepare Contributing Factors:

Our goal in the contributing factors remain the same as the vehicle types. The goal is to avoid any bias to the data. That means the scope of this preparation doesn't involve handwritten merging subcategories other than fixing typos and then select the types that matter for the 95% Frequency of the MCV Vehicle contributing factors.

The methodology is as following;

  1. Select the contributing factors that has more than 50 MVC occurrences
  2. Replace typos (with no subcategory merge in order to avoid bias)
  3. Consider only the vehicle types that occur for the 95% of vehicle collisions (MVC)

We see that the final contributing factors type 1 involve the following (20) factors: 'unspecified', 'driver inattention/distraction', 'failure to yield right-of-way', 'following too closely', 'passing or lane usage improper', 'backing unsafely', 'other vehicular', 'turning improperly', 'fatigued/drowsy', 'unsafe lane changing', 'traffic control disregarded', 'driver inexperience', 'lost consciousness', 'reaction to uninvolved vehicle', 'unsafe speed', 'pavement slippery', 'prescription medication', 'alcohol involvement', 'physical disability', 'outside car distraction'.

We do the same for the vehicle type 2 and we end up with the following (7) factors: 'unspecified', 'unknown', 'driver inattention/distraction', 'other vehicular', 'passing or lane usage improper', 'failure to yield right-of-way', 'following too closely'



At the end, we combine the two vehicle types contributing factors and we end up with only loosing 4.7% of the data. This is very welcoming.

Prepare Contributing Factor 1:

Prepare Contributing Factor 2:

Slice Focus Factors Type (covers more than 95 % of MVC occurrences)

Prepare Zip Features:

Here we replace unspecified string of ZIP CODE as NaN and the change the zip type to float64.

Extract new features:

Here we add two types of features:

  1. As discussed above we add the most important feature called Response. It shows as discussed in the beginning of the section whether we have an injured or dead victim.

  2. Other time related features like Year, Month, Day, Hour etc

Drop uncompleted years:

As a final thing we drop the year 2012 and 2021 due to the fact that they are not completed.

We realise we see we loose only 7.4% of the data.


Adding new Datasets:


In this section we merge information about the street limits of NYC and weather. Those will improve the forecasting of the Machine Learning part, hopefully!

Adding Speed_Limits Mode Data:

For the Speed Limits we select as limit the mode of the (mph) limit on avenues. The processing is the following;

  1. Drop missing values on streets and limits
  2. Use .lower() and .strip() on both datasets (MVC - Street Speed Limits) to do the merging
  3. See the number of matched Steets
  4. Calculate the mode speed for the matched Streets
  5. Measure the data reduction

We realise that we lose only 5.4% of the data. Super nice!

Adding weather data:

To add weather data we prepare **date, wind, rain, snow, fog/vision".

Lastly, we add temperature and wind speed for selfexplanatory reasons. We end up with the features;


Save final Data:


We save the final dataset:


Clear All Variables:


To free up memory space, clear all variables!

Part 02 - Data Analysis 01


Motivation


What is your dataset?

NYC Motor Vehicle Collisions - Crashes[link](https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95). It's freely available and has well defined Spatio-temporal information, as well as casualties and damages related features. Additionally, We will also consider the corresponding Weather[link](https://www.ncdc.noaa.gov/cdo-web/search) and Speed-Limit Data[link](https://data.cityofnewyork.us/Transportation/VZV_Speed-Limits/7n5j-865y)</span>.

Why did you choose this/these particular dataset(s)?

Vehicle crashes happen daily around the globe. They, for example, cost the New York City economy an enormous amount of $4 billion per year [link](https://nypost.com/2015/03/20car-accidents-cost-nyc-nearly-4-billion-a-year/)</span>. Thus, it's might be beneficial to invistigate the chossen data to learn more about this phenomena and analyse the core reasons and contributing factors behind those accidents.

What was your goal for the end user's experience?

To give the end user the ability to investigate the data in an intractive way, where they can learn and build there own assumptions about this phenomena based on strong statistical analysis and visulizations.

Terminology:

The project is focused on the Response variable that indicates whether there is a injury/kill or not, rephrasing it, whether there was a serious accident or not.


Genre:


Which tools did you use from each of the 3 categories of Visual Narrative (Figure 7 in Segal and Heer). Why?

For the Visual Narrative and accordingly to the Segal and Heer paper (Fig. 7) the following tools are used from each of the 3 categories Visual Structuring, Highlighting and Transition Guidance;

  1. Regarding the Visual Structuring, "the mechanisms that communicate the overall structure of the narrative to the viewer and allow him to identify his position within the larger organization of the visualization", for the site the tool that used visually is the Progress bar/Timebar. The progress is summarized in the subsections "Basic Statistics", "Feature Investigation", "Response Investigation" and "Vehicle Types & Contributing Factors". Specifically, the storytelling is summarized in bubbles in the site that have a progress/continuity. The user can freelly click the bubble that they want to elaborate and the analysis will popup.
  1. For the Highlighting, "the visual mechanisms that help direct the viewer's attention to particular elements in the display", there are used Feature Distinctions where we portait key features for the Data Exploration as well as Machine Learning Parts. For the map plots the we have Zooming. We wanted the reader to be able to investigate the collisions himself and express his curiosity.
  1. For the Transition Guidance, "move within or between visual scenes without disorienting the viewer", we use Object Continuity.

Which tools did you use from each of the 3 categories of Narrative Structure (Figure 7 in Segal and Heer). Why?

For the Narrative Structure the following tools are used from each of the 3 categories Ordering, Interactivity and Messaging;

  1. For Ordering, "the ways of arranging the path viewers take through the visualization", we use a User Directed Path where the user is free to navigate the prefered way. We wanted the user to have multiple ways of navigating and easily start from the topics that he wants to.
  1. For Interactivity, "the different ways a user can manipulate the visualization and how the user learns those methods", a list of tools are used. In our Bokeh plot we have Hover Highlighting / Details and Filtering / Selection / Search. The freedon of these tools allow the user to investigate the individual/preferred vehicle types and the contributing factors.
  1. In Messaging, "the ways a visualization communicates observations and commentary to the viewer", we used Captions/ Headlines and Summary as it is organised way in depicting the different insights and summarize conlusions.

Import needed libraries:



Load Final Data:



Basic Stats:


In these sections some basic stats are presented and some key findings are depicted. Later on these findings are visualized in the next section.

Summary Statistics:

Key insights from categorical features:

Key insights from numerical features:

The findings relate to the visualization depicted later on.

Some interesting counts:


Data Analysis and Visulization:


In this section we depicting different plots that will help us visually understand the case. The comments are written above and each of every plot.

Box and whisker plots

image info

Comments:

Above we see the Blox-plots for Motor Vehicle Collisions (MVC) data;

image info

Comments:

Above we see the Blox-plots for Speed Limits as well as Weather data;

  1. The median speed is 25 MPH.
  2. Usually we have no precipitation (rain) or snow.
  3. The median maximum temperature is around 60 fahrenheit and the median minimum is at around 50.

Respone over Years Plot

Comments:

Above we see the Responses (Non-Serious vs Serious accidents) thoughout the years 2013 to 2020.

Response over Borough Plot

Comments:

Above we see the Response (meaning the serious (in red) VS the non-serious (in blue) accidents) over Borough.

Response over 24 Hours Plot

Correlation Matrix Plot

Comments:

Above we see the correlation matrix of our features. Key insights;

Jitter-plots:

Comments:

The figure above shows a jitter-plot of New York City Collisions in January, with collisions with/ without injuries/kills 2003-2017 between 13.00 - 14.00.

The pattern of non-injuries/kills is much busier that the lethal car accidents. This makes sense.

We also observe a high amount of accident being registered only at intervals of 10 mins (the officers rounded up the numbers to simplify the time) but the are also other lethal accidents that depict the exact minute.

Histogram-Plots:

Comments:

The above are histograms that show the latitude distribution of the two responses in only January from 2013 until 2020 using 50 bins.

It can be seen that the two crimes histograms shows mixed Gaussian spatio-temporal distributions (January & latitude), excluding the high peak noise/outliers.

Map-plot:

Comments:

The above field gives the user the ability to pick a collision and display its incidents on the map for a selected period, specified with a Start and End Date.

We chose to visualize the serious/lethal collisions with red and the non-serious collisions with blue in New York City for the month of January 2020.

We observe that the Central Park/ Manhattan in general is very heavy in collisions and also for Brooklyn.

We also, see the number of non-serious collisions (blue) is much bigger than the serious/lethal ones.

Bokeh-Plot:

Define a general Bokeh-plot function, for all Contributing Factors and Vehicle Types:

Bokeh-plot: Vehicle Type Code 1

Comments:

The reader can investigate the interactive bookeh plots depicting the Hourly distribution of the vehicle type code 1's.

Key insights;

Bokeh-plot: Vehicle Type Code 2

Comments:

The reader can investigate the interactive bookeh plots depicting the Hourly distribution of the vehicle type code 2's.

Bokeh-plot: Contributing Factor Vehicle 1

Comments:

The reader can investigate the interactive bookeh plots depicting the Hourly distribution of the contributing factors vehicle 1's.

Bokeh-plot: Contributing Factor Vehicle 2

Comments:

The reader can investigate the interactive bookeh plots depicting the Hourly distribution of the contributing factors vehicle 1's.


Clear All Variables:



Part 03 - Data Analysis 02 (Machine Learning).



Import Needed Libraries and Set Seed:


First we start by importing needed Libraries


Load Data:



Testing for different spatio-temporal distribution:


Kolmogorov-Smirnov statistic of the two response values is tested, accordingly on the attributes 'LATITUDE', 'Hour' and 'Month'.

Specifically, the two-sample Kolmogorov-Smirnov test is performed to check whether the response labels are drawn from the same distribution under the significance level $α = 0.05$. If the p-value is high, then we cannot reject the hypothesis that the distributions of the two samples is the same.

Above are histograms that show the latitude distribution of the two 'Response' labels, in only January from 2013 until 2021 using 50 bins. It can be seen that the two crimes histograms shows different spatio-temporal distributions.


Slice needed features for learning:


Below we slice the needed features for learning:

Y variable: 'Response'

X-Matrix: 
    Time features: 'Month', 'Day of week', 'Hour' 
    Place features: 'ON STREET NAME'
    Vehicle features: 'VEHICLE TYPE CODE 1', 'VEHICLE TYPE CODE 2'
    Casual features: 'CONTRIBUTING FACTOR VEHICLE 1', 'CONTRIBUTING FACTOR VEHICLE 2'
    Speed Features: 'SPEED LIMIT MODE'
    Weather features: 'PRECIPITATION', 'SNOW FALL', 'SNOW DEPTH','FOG, SMOKE OR HAZE', 'AVERAGE WIND SPEED', 'MAXIMUM TEMPERATURE', 'MINIMUM TEMPERATURE'.

We dropped the 'LATITUDE', 'LONGITUDE', 'BOROUGH' and 'ZIP CODE' since we have 'ON STREET NAME'. On the other hand, combining the three features Month, DayOfWeek and Hour can best describe the time-related information.


Class Balance:


Since we don't want to increase the probability of classifying any new random point as the class (Response = 0) with more occurrences.

Here we will use the Down-sample Majority Class method see link


Define Data Preprocessor:


The below preprocessor will be used as a first step in the pipeline models for data processing. Defining it this way can allow us to use it in each cross-validate fold. The reason behind that is to prevent leaking information from training to test data.


Split data for Learning:



Compare Algorithms:


Here we compare the performance of the two algorithms we have been introduced to RandomForest, DecisionTree and Logistic Regression on the training data (Both with sklearn default settings) using 5 Stratified fold CV.

Here we chose to have recall (also called the Sensitivity) as a scoring metric since it is the true positive rate. It is the number of instances from the positive (response = 1) class that actually predicted correctly and this is what we are intressted in.

Above are the Box and Whisker plots comparing the two algorithms, Decision Tree, RandomForest and Logistic Regression, Performance. It's seen that Decision Tree outcome the other on sklearn default settings. Therefore it's worth it to start tuning Decision Tree and investigate it more.


Tunning:


Decision Tree tuning:

By the means of Grid Search CV we are goin to investigate:


Report:


Here we save the Final Model with the best hyperparameters founded in last Grid Search.

Finally, it's time to test the Final Model on the validation (unseen) data and Report results. See Discussion for results analysis


Feature importance:


Here you go a 5 fold CV to check feature importance. This can be useful not only to check feature importance but also to improve the model or to come with a simpler model that has a relatively reasonable performance.


Tree Plot:


Here we gonna plot the Final Model tree but for visualization perposes we will set the maximum depth to 2.

A little description of the information at each plotted node


Machine Learning Discussion:


Explain about your choices for training/test data, features, and encoding. (You decide how to present your results, but here are some example topics to consider: Did you balance the training data? What are the pros/cons of balancing? Do you think your model is overfitting? Did you choose to do cross-validation? Which specific features did you end up using? Why? Which features (if any) did you one-hot encode? Why ... or why not?))

Below is a summary of the algorithm we used to approach solving the classification problem at hand:

Below is a clarification of the choices we made:


Part 04 - Discussion & Contribution:


Discussion

We have investigated the NYC Motor Vehicle Collisions - Crashes[link](https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95) and related Weather[link](https://www.ncdc.noaa.gov/cdo-web/search) and Speed Limit data[link](https://data.cityofnewyork.us/Transportation/VZV_Speed-Limits/7n5j-865y)</span>. The analysis gives a strong scientific insight into the core reasons and contributing factors behind those accidents.

It was noticed for example:

This information can be used to prevent dangerous MVC with at least an injures happens.

What is still missing? What could be improved? Why?

In general, everything was under control. We managed to process the data and add new datasets without losing so many observations. We have achieved a really nice Machine learning model to predict if MVC is dangerous or not With such a high recall (on a huge unseen data) focusing on dangerous accidents.

What can be improved

It would be nice if we had some access to better computational resources for the Machine learning Part. By this, we can train our model on more data and thus having a better model. Where we can use it for not only prediction but also to see the important features to predict dangerous accidents.

Contribution:

All the group members were involved in all the tasks from this assignment! This gave us the chance not only to work on a task that interest a group member the most, but also to be introduced to all the tasks.

We had a task main responsible who took a lead role. The following table will explain who were the main responsible for different tasks from this project;

Task Main Responsible
Initialize the Idea Efstathios - s203021
Finding datasets Asterios - s202242
Data Preprocessing AbdulStar - s174360
Data Analysis & Visualization AbdulStar - s174360
Machine Learning AbdulStar - s174360
Define the Genre Asterios - s202242
Website Efstathios - s203021
Commenting Notebook & Figures Asterios - s202242